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Abstract — A novel content-based heterogeneous information 
retrieval framework, particularly well suited to browse medical 
databases and support new generation computer aided diagnosis 
(CADx) systems, is presented in this paper. It was designed to 
retrieve possibly incomplete documents, consisting of several 
images and semantic information, from a database; more complex 
data types such as videos can also be included in the framework. 
The proposed retrieval method relies on image processing, in 
order to characterize each individual image in a document by 
their digital content, and information fusion. Once the available 
images in a query document are characterized, a degree of match, 
between the query document and each reference document stored 
in the database, is defined for each attribute (an image feature 
or a metadata). A Bayesian network is used to recover missing 
information if need be. Finally, two novel information fusion 
methods are proposed to combine these degrees of match, in order 
to rank the reference documents by decreasing relevance for the 
query. In the first method, the degrees of match are fused by the 
Bayesian network itself. In the second method, they are fused 
by the Dezert-Smarandache theory: the second approach lets us 
model our confidence in each source of information (i.e., each 
attribute) and take it into account in the fusion process for a better 
retrieval performance. The proposed methods were applied to two 
heterogeneous medical databases, a diabetic retinopathy database 
and a mammography screening database, for computer aided 
diagnosis. Precisions at five of 0.809 ± 0.158 and 0.821 ± 0.177, 
respectively, were obtained for these two databases, which is very 
promising. 

Index Terms — Diabetic retinopathy, heterogeneous information 
retrieval, information fusion, mammography, medical databases. 
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I. Introduction 

T WO main tasks in computer aided diagnosis (CADx) 
using medical images are extraction of relevant infor- 
mation from images and combination of the extracted features 
with other sources of information to automatically or semi-au- 
tomatically generate a reliable diagnosis. One promising way 
to achieve the second goal is to take advantage of the growing 
number of digital medical databases either for heterogeneous 
data mining, i.e., for extracting new knowledge, or for het- 
erogeneous information retrieval, i.e., for finding similar 
heterogeneous medical records (e.g., consisting of digital im- 
ages and metadata). This paper presents a generic solution to 
use digital medical databases for heterogeneous information 
retrieval, and solve CADx problems using case-based reasoning 
(CBR) [1], 

CBR was introduced in the early 1980s as a new decision 
support tool. It relies on the idea that analogous problems 
have similar solutions. In CBR, interpreting a new situation 
revolves around the retrieval of relevant documents in a case 
database. The knowledge of medical experts is a mixture of 
textbook knowledge and experience through real life clinical 
cases, so the assumption that analogous problems have similar 
solutions makes sense to them. This is the reason why there is 
a growing interest in CBR for the development of medical de- 
cision support systems [2]. Medical CBR systems are intended 
to be used as follows: should a physician be doubtful about 
his/her diagnosis, he/she can send the available data about 
the patient to the system; the system selects and displays the 
most similar documents, along with their associated medical 
interpretations, which may help him/her confirm or invalidate 
his/her diagnosis by analogy. Therefore, the purpose of such 
a system is not to replace physicians’ diagnosis, but rather to 
aid their diagnosis. Medical documents often consist of digital 
information such as images and symbolic information such 
as clinical annotations. In the case of diabetic retinopathy, for 
instance, physicians analyze heterogeneous series of images 
together with contextual information such as the age, sex and 
medical history of the patient. Moreover, medical information is 
sometimes incomplete and uncertain, two problems that require 
a particular attention. As a consequence, original CBR systems, 
designed to process simple documents such as homogeneous 
and comprehensive attribute vectors, are clearly unsuited to 
complex CADx applications. On one hand, some CBR systems 
have been designed to manage symbolic information [3]. On 
the other hand, some others, based on content-based image 
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retrieval [4], have been designed to manage digital images 
[5]. However, few attempts have been made to merge the two 
kinds of approaches. We consider in this paper a larger class of 
problems: CBR in heterogeneous databases. 

To retrieve heterogeneous information, some simple ap- 
proaches, based on early fusion (i.e., attributes are fused in 
feature space) [6], [7] or late fusion (i.e., attributes are fused 
in semantic space) [8]— [ 1 0] have been presented in the litera- 
ture. A few application-specific approaches [1 1] — [15], as well 
as a generic retrieval system, based on dissimilarity spaces 
and relevance feedback [16], have also been presented. We 
introduce in this paper a novel generic approach that does not 
require relevance feedback from the user. The proposed system 
is able to manage incomplete information and the aggregation 
of heterogeneous attributes: symbolic and multidimensional 
digital information (we focus on digital images, but the same 
principle can be applied to any n-dimensional signals). The 
proposed approach is based on a Bayesian network and the 
Dezert-Smarandache theory (DSmT) [17]. Bayesian networks 
have been used previously in retrieval systems, either for key- 
word based retrieval [18], [19] or for content-based image or 
video retrieval [20], [21]. The Dezert-Smarandache theory is 
more and more widely used in remote sensing applications [17], 
however, to our knowledge, this is its first medical application. 
In our approach, a Bayesian network is used to model the rela- 
tionships between the different attributes (the extracted features 
of each digital image and each contextual information field): we 
associate each attribute with a variable in the Bayesian network. 
It lets us compare incomplete documents: the Bayesian net- 
work is used to estimate the probability of unknown variables 
(associated with missing attributes) knowing the value of other 
variables (associated with available attributes). Information 
coming from each attribute is then used to derive an estima- 
tion of the degree of match between a query document and a 
reference document in the database. Then, these estimations 
are fused; two fusion operators are introduced in this paper for 
this purpose. The first fusion operator is incorporated in the 
Bayesian network: the computation of the degree of match, 
with respect to a given attribute, relies on the design of condi- 
tional probabilities relating this attribute to the overall degree 
of match. An evolution of this fusion operator that models our 
confidence in each source of information (i.e., each attribute) is 
introduced. It is based on the Dezert-Smarandache theory. In 
order to model our confidence in each source of information, 
within this second fusion operator, an uncertainty component is 
included in the belief mass function characterizing the evidence 
coming from this source of information. 

The main advantage of the proposed approach, over stan- 
dard feature selection/feature classification approaches, is that 
a retrieval model is trained separately for each attribute. This 
is useful to process incomplete documents: in the proposed ap- 
proach, we simply combine the models associated with all avail- 
able attributes; as a comparison, a standard classifier relies on 
feature combinations, and therefore may become invalid when 
input feature vectors are incomplete. Also, because each at- 
tribute is processed separately, the curse of dimensionality is 
avoided. Therefore, it is not necessary to select the most rele- 
vant features: instead, we simply weight each feature by a con- 
fidence measure. 




(a) (b) (c) 

Fig. 1. Examples of Bayesian networks, (a) A chain, (b) A polytree, i.e., a 
network in which there is at most one (undirected) path between two nodes, (c) 
A network containing a cycle: ( A , D, E, C. .4) . 

The paper is organized as follows. Section II presents the pro- 
posed Bayesian network based retrieval. Section III presents the 
Bayesian network and Dezert-Smarandache theory based re- 
trieval. These methods are applied in Section IV to CADx in 
two heterogeneous databases: a diabetic retinopathy database 
and a mammography database. We end with a discussion and a 
conclusion in Section V. 

II. Bayesian Network Based Retrieval 

A. Description of Bayesian Networks 

A Bayesian network [22] is a probabilistic graphical model 
that represents a set of variables and their probabilistic depen- 
dencies. It is a directed acyclic graph whose nodes represent 
variables, and whose edges encode conditional independencies 
between the variables. Examples of Bayesian networks are 
given in Fig. 1 . 

In the example of Fig. 1(b), the edge from the parent node A 
to its child node D indicates that variable A has a direct influ- 
ence on variable D. Each edge in the graph is associated with 
a conditional probability matrix expressing the probability of a 
child variable given one of its parent variables. For instance, if 
A = {ao 5 ffli} an d D = {do, d\, df\, then A — » D is assigned 
the following (3 x 2) conditional probability matrix P[D\A) 

fP(D = do\A = a 0 ) P(D = d 0 \A = ai )\ 

P(D\A) = P{D = d 1 \A = ao) P(D = di\A = oi) . 

\P(D = d 2 \A = a 0 ) P(D = d 2 \A = a 1 )J 

(1) 

A directed acyclic graph is a Bayesian Network relative 
to a set of variables {X\,...,X n } if the joint distribution 
P{X \ , . . . , X n ) can be expressed as in 

n 

P(X i , • X n ) = JJ P (Xi |pareiits(Xi)) (2) 

2=1 

where parents(X) is the set of nodes such that Y — > X is in 
the graph V7 £ parents(X). Because a Bayesian network can 
completely model the variables and their relationships, it can be 
used to answer queries about them. Typically, it is used to esti- 
mate unknown probabilities for a subset of variables when other 
variables (the evidence variables) are observed. This process 
of computing the posterior distribution of variables, given ev- 
idence, is called probabilistic inference. In Bayesian networks 
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containing cycles, exact inference is a NP-hard problem. Ap- 
proximate inference algorithms have been proposed, but their 
accuracies depend on the network’s structure; therefore, they 
are not general. By transforming the network into a cycle-free 
hypergraph, and performing inference in this hypergraph, Lau- 
ritzen and Spiegelhalter proposed an exact inference algorithm 
with relatively low complexity [23]; this algorithm was used in 
the proposed system. 

B. Learning a Bayesian Network From Data 

A Bayesian network is defined by a structure and the condi- 
tional probability of each node given its parents in that structure 
(or its prior probability if it does not have any parent). These pa- 
rameters can be learned automatically from data. Defining the 
structure consists in finding pairs of nodes ( X , Y) directly de- 
pendent, i.e., such that: 

• X and Y are not independent ( P(X,Y ) ^ P(X)P(Y)); 

• there is no node set Z such that X and Y are independent 
given Z(P(X, Y\Z) ± P(X\Z)P(Y\Z)). 

Independence and conditional independence can be assessed by 
mutual information [see (3)] and conditional mutual informa- 
tion [see (4)], respectively 

7(X,Y) = |>(„)l„ g T|^ (3, 

<*> 

Two nodes are independent (resp. conditionally independent) 
if mutual information (resp. conditional mutual information) is 
smaller than a given threshold e, 0 < e < 1. Ideally, e should be 
equal to 0. However, in the presence of noise, some meaningless 
edges (links) can appear. These edges can also unnecessarily 
increase the computation time. To avoid this, in this study, e was 
chosen in advance to be equal to 0. 1 . This number is independent 
of dataset cardinality [24]. 

The structure of the Bayesian network, as well as edge orien- 
tation, was obtained by Cheng’s algorithm [24] . This algorithm 
was chosen for its complexity: complexity is polynomial in the 
number of variables, as opposed to exponential in competing al- 
gorithms. 

C. Including Images in a Bayesian Network 

Contextual information are included as usual in a Bayesian 
network: a variable with a finite set of states, one for each pos- 
sible attribute value, is defined for each field. 

To include images in a Bayesian network, we first define a 
variable for each image in the sequence. For each "image vari- 
able,” we follow the usual steps of Content-Based Image Re- 
trieval (CBIR) [4]: 1) building a signature for each image (i.e., 
extracting a feature vector summarizing their digital content), 
and 2) defining a distance measure between two signatures (see 
Section II-C-1). Thus, measuring the distance between two im- 
ages comes down to measuring the distance between two sig- 
natures. Similarly, in a Bayesian network, defining states for an 
“image variable” comes down to defining states for the signa- 
ture of the corresponding images. To this aim, similar image 
signatures are clustered, as described below, and each cluster is 



associated with a state. Thanks to this process, image signatures 
can be included in a Bayesian network like any other variable. 

1 ) Image Signature and Distance Measure: In previous 
works on CBIR, we proposed to extract a signature for images 
from their wavelet transform [25]. These signatures model the 
distribution of the wavelet coefficients in each subband of the 
decomposition; as a consequence they provide a multiscale 
description of images. To characterize the wavelet coeffi- 
cient distribution in a given subband, Wouwer’s work was 
applied [26]: Wouwer has shown that this distribution can be 
modeled by a generalized Gaussian function. The maximum 
likelihood estimators of the wavelet coefficient distribution 
in each subband are used as a signature. These estimators 
can be computed directly from wavelet-based compressed 
images (such as JPEG-2000 compressed images), which can 
be useful when a large number of images has to be processed. 
A simplified version of Do’s generalized Gaussian parameter 
estimation method [25], [27] is proposed in Appendix A to 
reduce computation times. Any wavelet basis can be used to 
decompose images. However, the effectiveness of the extracted 
signatures largely depends on the choice of this basis. For this 
reason, we proposed to search for an optimal wavelet basis [25] 
within the lifting scheme framework, which is implemented 
in the compression standards. To compare two signatures, Do 
proposed the use of the Kullback-Leibler divergence between 
wavelet coefficient distributions P and Q in two subbands [27] 

D(P\\Q) = J p(x)log^dx (5) 

R 

where p and q are the densities of P and Q, respectively. A sym- 
metric version of the Kullback-Leibler divergence was used, 
since clustering algorithms require (symmetric) distance mea- 
sures 

l(D(P\\Q) + D(Q\\P)). (6) 

Finally, the distance between two images is defined as a 
weighted sum of these distances over the subbands, noted 
W SD; weights are tuned by a genetic algorithm to maximize 
retrieval performance on the training set [25]. The ability to 
select a weight vector and a wavelet basis makes this image 
representation highly tunable. We have shown in previous 
works the superiority of the proposed image signature, in 
terms of retrieval performance, over several well-known image 
signatures [25]. 

2) Signature Clustering: In order to define several states for 

an “image variable,” similar images are clustered with an un- 
supervised classification algorithm, thanks to the image signa- 
tures and the associated distance measure above. Any algorithm 
can be used, provided that the distance measure can be speci- 
fied. We chose the well-known fuzzy C-means algorithm (FCM) 
[28] and replaced the Euclidean distance by WSD described 
above. In this algorithm, each document is assigned to each 
cluster k = I . . . K with a fuzzy membership 0 < < 1, 

such that %2k=i u k — 1> which can be interpreted as a proba- 
bility. Finding the right number of clusters is generally a difficult 
problem. However, when each sample has been assigned a class 
label, mutual information between clusters and class labels can 
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Fig. 2 . Bayesian network based retrieval. Solid-lined arrows mean “leads to” 
or “is followed by” and dashed-lined arrows mean “is used by.” 

be used to determine the optimal number of clusters K [29] [see 
(7)] 

K = arg max ^ P k ) 1 o Sc+a- /^)P{k) (?) 

where c = 1 ... C are the class labels, P(c, k ) is the joint prob- 
ability distribution function of the class and cluster labels, P(c) 
and P(k) are the marginal probability distribution functions. 
Other continuous variables can be discretized similarly: the age 
of a person, 1-D signals, videos, etc. 

D. System Design 

Let x q be a query document and M be the number of at- 
tributes. 

Definition: A document x is said to be relevant for x q if x 
and x q belong to the same class. 

To assess the relevance of each reference document in a data- 
base for x q , we define a Bayesian network with the following 
variables: 

• a set of variables { Ai , % = 1 . . . M}, where A.; represents 
the ith attribute of x; 

• a Boolean variable Q = “x is relevant for x q " ( Q = 
u x is not relevant for x q "). 

The design of the system is described hereafter and illustrated 
in Fig. 2. To build the network, the first step is to learn the dif- 
ferent relationships between the attributes {Ai,i = 1 ... M}. 
So, an intermediate network is built from data, using Cheng’s 
algorithm (see Section II-B). In that purpose, the studied data- 
base is divided into a training dataset and a test dataset. Cheng’s 
algorithm is applied to the training dataset. In our experiments, 
the query document x q belongs to the test dataset and x belongs 
to the training dataset. To build this Bayesian network, a finite 




Fig. 3 . Retrieval Bayesian Network (built for the database presented in 
Section IV- A). In the example of (b), attributes A i, . . . , Aq, As, Aio, A13, 
A 14 , A 15, A17, A i 8 , A 22, A 23 are available for the query document x q , so 
the associated nodes are then connected to node Q. (a) Intermediate network, 
(b) Query- specific network. 

number of states is defined for each variable Ai, i = 1 . . . M. 
To learn the relationships between these variables, we use the 
membership degree of any document y in the training dataset to 
each state aij of each variable A,, noted aij(y). If Ai is a nom- 
inal variable, a{j(y) is boolean; for instance, if y is a male then 

a "sex”,"male”(2/) = and = °' If A > iS 

a continuous variable (such as an image-based feature), a.ik(y) 
is the fuzzy membership of y to each cluster k = I . . . K (see 
Section II-C-2). An example of intermediate network is given 
in Fig. 3(a). 

Q is then integrated in the network. For retrieval, the attributes 
of x are observable evidences for (), as a consequence the as- 
sociated variables should be descendants of Q. In the retrieval 
network, the probabilistic dependences between Q and each 
variable Ai depend on x q . In fact, x q specifies which attributes 
should be found in the retrieved documents in order to meet the 
user’s needs. So, when the ith attribute of x q is available, we 
connect the two nodes Q and A; and we estimate the associated 
conditional probability matrix P q (Ai = ciij\Q ) according to x q 
[see Fig. 3(b)]. The index q denotes that the probability depends 
on x q . A query-specific network is obtained: its structure de- 
pends on which attributes are available for the query document 
and the conditional probability matrices depend on the value 
taken for these available attributes by the query document. This 
network is used to assess the relevance of any reference docu- 
ment for x q . 

E. Computing the Conditional Probabilities P q (Ai = a,ij\Q ) 

To compute P q (Ai = a,; ;/ 1 Q) , we first estimate P q (Q\Ai = 
aij ) : the probability that a reference document x, w ith full mem- 
bership to the state a ij of attribute Ai, is relevant. P q (Ai = 
a,,j\Q) can then be computed thanks to Bayes’ theorem [see 
(8)]. The prior probability P q (Q) is required; it can be estimated 
by the probability that two documents belong to the same class, 
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m(0) = 0 and ^ A€2 e m (A) = 1. Belief masses let us ex- 
press our uncertainty; it is possible for instance to define con- 
fidence intervals on probabilities: depending on external cir- 
cumstances, the probability of Q can range from m(Q) and 
m(Q) + rn(Q U Q). DSmT takes one step further: a (gener- 
alized) belief mass m(A) is assigned to each element A of the 
hyper-power set D(8) = {0, Q, Q, Q (~l Q, Q U Q}, i.e., the set 
of all composite propositions built from elements of 8 with n 
and U operators, such that m(0) = 0 and ^ \^d{ 6) m (A) = 1. 

The belief mass functions m 7 ; must be first specified by the 
user for each source of information, i = 1 . . . M (mi functions 
used in our system are described below, in Paragraph III-C). 
Then, mass functions rrii are fused into the global mass function 
rn/, according to a given rule of combination. Another differ- 
ence between DST and DSmT comes from the underlying rules 
of combinations. Several rules, designed to better manage con- 
flicts between sources, were proposed in DSmT, including the 
hybrid rule of combination [17] and the proportional conflict 
redistribution (PCR) rules [31]. It is possible to introduce con- 
straints in the model [17]: we can specify pairs of incompatible 
hypotheses ( 8 a , 8b), i.e., each subset A of 8 a fl 8b must have a 
null mass, noted A E C(8). 

Once the fused mass function to/ has been computed, we 
can compute the belief (credibility) and the plausibility of each 
hypothesis A (or any other element of D{8)) as follows: 

Bel(A) = E m f w < 14) 

Bi C A,Bi £D(0) 

Pl(A) = E m /(^) 

B i nAGC(e)uH,B i GD(9) 

= 1 -Bel(A). (15) 

Belief and plausibility are respectively pessimistic and opti- 
mistic. Pignistic probability [32], a possible compromise, is 
used instead (see below, in Paragraph III-D); other probabilistic 
transformations are available [33]. 

B. Link With Bayesian Network Based Retrieval 

Our motivation for using the theory of belief functions, 
instead of the Bayesian theory, is that the former lets us model 
our confidence in each source of information, instead of taking 
each piece of information at face value. This property is partic- 
ularly attractive for a medical decision support system where 
heterogeneous sources of information, with varying reliability, 
are combined. Because its fusion operators better manage 
conflicting sources of information, a common occurrence when 
these sources are unreliable, DSmT was used instead of the 
original theory of belief functions. 

In the Bayesian network based method (see Section II), the 
relevance of a reference document for the query, according 
to a given attribute A,, has been estimated through the de- 
sign of conditional probabilities P q (Q\Ai = ciij ). The M 
sources of information (represented by the network variables 
A,, % = 1 . . . M) were then fused by the Bayesian network 
inference algorithm [see Fig. 3(b)] to compute the posterior 
probability of Q, P q (Q\x), for a document x in the database. 
We can translate this Bayesian fusion problem into the frame- 
work of the belief mass theory. Let 8 = { Q , Q } be the frame of 



discernment. For each source *( A), we defined (13) a degree 
of match dnrii(x,x q ) between x and the query x q , which may 
be viewed as the belief mass rrii(Q) assigned to hypothesis Q 
and consequently m,i(Q) = 1 — m 7 ;(Q) was assigned to Q. 

In that first approach, we did not model our confidence in the 
estimation of the relevance provided by each source of evidence 
(through the design of conditional probabilities). And poor esti- 
mations of the relevance provided by some sources might mis- 
lead the computation of the fused estimation. So we would like 
to give more importance in the fusion process to the trusted 
sources of evidence. We propose to use DSmT to model our 
confidence in each source of evidence, as explained below. 

C. System Design 

To extend the previous method in the DSmT framework, we 
assign a mass not only to Q and Q, but to each element in 
D(8) = {0, Q, Q, Q fl Q, Q U Q}. Assigning a mass to Q fl Q 
is meaningless, so we only assign a mass to elements in I) (8) \ 
Q n Q = {0, Q,Q,Q U Q} = 2 s (it is actually Shafer’s model 
[30]). 

To compute the belief masses for a given source of infor- 
mation i, we defined a test A on the degree of match dnii : 
Ti(x,x q ) is true if dm(x,x q ) >= 77 , 0 < 7/ < 1, and false 
otherwise. The mass functions are then assigned according to 
Ti(x, x q ). 

• if Ti(x,x q ) is true: 

— rrii(Q) = P(Ti(x,x q ) \x is relevant for x q ) (the sensi- 
tivity of T.;) 

— rrii(Q U Q) = 1 — m,i{Q) 

— rrii(Q) = 0 

• else 

— rrii(Q) = P(Ti(x,x q ) \x is not relevant for x q ) (the 
specificity of T.;) 

— rrii(Q U Q) = 1 — m,i(Q ) 

— rrii(Q) = 0. 

The sensitivity (resp. the specificity) represents the degree 
of confidence in a positive (resp. negative) answer to test 
A; rriAQ U Q) is assigned the degree of uncertainty. The 
sensitivity of A, for a given threshold T t , is defined as the 
percentage of pairs of training documents ( yi,U 2 ) from the 
same class such that A(yi> 2 / 2 ) is tme - Similarly, the specificity 
of A is defined as the percentage of pairs of training documents 
(21 , 22 ) from different classes such that T;(z\ , 22) is false. Test 
A is relevant if it is both sensitive and specific. As 7 / increases, 
sensitivity increases and specificity decreases. So, we set t, as 
the intersection of the two curves “sensitivity according to t,” 
and “specificity according to t 7 >” A binary search is used to 
find the optimal r t . 

D. Retrieval Process 

To process a reference document x, every available attribute 
for x is processed as evidence and Lauritzen and Spiegelhalter’ s 
inference algorithm is used to estimate oiij{x) Vj, i = 1 . . . M. 
If the ith attribute of x q is available, the degree of match 
drrii(x,x q ) is computed according to otij(x) [see (13)] and 
the belief masses are computed according to test T; {x, x q ). 
The sources available for x q are then fused. Usual rules of 
combination have a time complexity exponential in M, which 
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Fig. 5. Bayesian network and Dezert-Smarandache based retrieval. 



might be a limitation. So we proposed a rule of combination 
for two-hypotheses problems ( Q and () in our application), 
adapted from the PCR rules, with a time complexity polyno- 
mial in M [34]. Once the sources available for x q are fused 
by the proposed rule of combination, the pignistic probability 
betP(Q) is computed by the following: 




(a) (b) (c) (d) (e) 




© (g) (h) (i) (j) 



Fig. 6. Photograph sequence of a patient eye. Images (a)-(c) are photographs 
obtained with different color filters. Images (d)-(j) constitute a temporal an- 
giographic series: a contrast agent (fluorescein) is injected and photographs are 
taken at different stages [early (d), intermediate (e)-(i), late (j)]. At the interme- 
diate stage, photographs from the periphery of the retina are available. 



The contextual information available is the age and sex of the 
patient, as well as structured medical information (see Table I). 
Patients records consist of at most 10 images per eye (see Fig. 6) 
and 13 contextual attributes; 12.1% of these images and 40.5% 
of these contextual attribute values are missing. The disease 
severity level, according to ICDRS classification [35], was as- 
sessed by a single expert for all 67 patients: because of intra-ob- 
server variability, the reference standard is imperfect. The dis- 
tribution of the disease severity among the above-mentioned 67 
patients is given in Table II. 

B. Digital Database for Screening Mammography (DDSM) 



bet P(Q) = m f (Q ) + W/ (16) 

The process is illustrated in Fig. 4(b) and Fig. 5. The reference 
documents are then ranked in decreasing order of betP(Q). 

IV. Application to Medical Image Databases 

The proposed method has been applied to CADx on two 
heterogeneous databases. First, it has been applied to diabetic 
retinopathy severity assessment on a dataset (DRD) built at the 
Inserm U650 laboratory, in collaboration with ophthalmologists 
of Brest University Hospital. Then, it has been applied to breast 
cancer screening on a public access database (DDSM). 

A. Diabetic Retinopathy Database 

The diabetic retinopathy database contains retinal images of 
diabetic patients, with associated anonymized information on 
the pathology. Diabetes is a metabolic disorder characterized 
by sustained inappropriately high blood sugar levels. This pro- 
gressively affects blood vessels in many organs, which may lead 
to serious renal, cardiovascular, cerebral, and also retinal com- 
plications. The latter case, namely diabetic retinopathy, can lead 
to blindness. The database consists of 67 patient files containing 
1112 photographs altogether. Images have a definition of 1280 
pixels/line for 1008 lines/image. They are lossless compressed 
images. Patients have been recruited at Brest University Hos- 
pital (France) since lune 2003 and images were acquired by ex- 
perts using a Topcon Retinal Digital Camera (TRC-50IA) con- 
nected to a computer. An image series is given in Fig. 6. 



The DDSM project [36], involving the Massachusetts Gen- 
eral Hospital, the University of South Florida and the Sandia 
National laboratories, has built a mammographic image data- 
base for research on breast cancer screening. It consists of 2277 
patient files. Each of them includes two images of each breast, 
associated with patient information (age at time of study, sub- 
tlety rating for abnormalities, American College of Radiology 
breast density rating and keyword description of abnormalities) 
and image information (scanner, spatial resolution, etc.). The 
following contextual attributes are used in this study: 

• the age at time of study; 

• the breast density rating. 

Images have a varying definition, of about 2000 pixels/line for 
5000 lines/image. An example of image sequence is given in 
Fig. 7. There is no missing information in DDSM. 

Each patient file has been graded by a physician. Patients are 
then classified in three groups: normal, benign and cancer. The 
distribution of grades among the patients is given in Table II. 
The reference standard is also affected by intra- and inter-ob- 
server variability in this dataset. 

C. Objective of the System 

Definition: Let x q be a query document, and X \ . . 7 : 2 , ... ■ Xjc 
be its 1C most similar documents within the training set. The 
precision at 1C for x q is the fraction of documents, among 
{x\,X 2 , . . . , xjc}, that belong to the same class as x q . 

For each query document, we want to retrieve the most sim- 
ilar reference documents in a given database. Satisfaction of the 
user’s needs can thus be assessed by the precision at 1C. The 
average precision at 1C measures how good a fusion method is 
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TABLE I 

Structured Contextual Information for Diabetic Retinopathy Patients 



category 


attributes 


possible values 




family clinical context 


diabetes, glaucoma, blindness, misc. 




medical clinical context 


arterial hypertension, dyslipidemia, protenuria, renal dialysis, allergy, misc. 


general clinical context 


surgical clinical context 


cardiovascular, pancreas transplant, renal transplant, misc. 




ophthalmologic 
clinical context 


cataract, myopia, AMD, glaucoma, unclear medium, 
cataract surgery, glaucoma surgery, misc. 




diabetes type 


none, type I, type 11 




diabetes duration 


< 1 year. 1 to 5 years, 5 to 10 years, >10 years 


examination and diabetes context 


diabetes stability 


good, bad, fast modifications, glycosylated hemoglobin 




treatments 


insulin injection, insulin pump, anti-diabetic drug + insulin, 
anti-diabetic drug, pancreas transplant 


eye symptoms reported 


ophthalmologically 

symptomatic 


none, systematic ophthalmologic screening - known diabetes, recently 
diagnosed diabetes by check-up, diabetic diseases other than ophthalmic ones 


before the angiography test 


ophthalmologically 

asymptomatic 


none, infection, unilateral decreased visual acuity (DVA), bilateral DVA, 
neovascular glaucoma, intra-retinal hemorrhage, retinal detachment, misc. 


maculopathy 


maculopathy 


focal edema, diffuse edema, none, ischemic 



TABLE II 

Patient Disease Severity Distribution 



database 


disease severity 


number of 
patients 




no apparent diabetic retinopathy 


7 




mild non-proliferative 


9 


DRD 


moderate non-proliferative 


22 




severe non-proliferative 


9 




proliferative 


9 




treated/non active diabetic retinopathy 


11 




normal 


695 


DDSM 


benign 


669 




cancer 


913 




(a) (b) (c) (d) 

Fig. 7. Mammographic image sequence of the same patient, (a) and (b) Two 
views of the left breast, (c) and (d) Two views of the right one. 



at combining feature-specific distance measures into a semanti- 
cally meaningful distance measure. 

I). Patient File Features 

In those databases, each patient file consists of both digital 
images and contextual information. Contextual attributes (13 
in DRD, 2 in DDSM) are processed as-is in the CBR system. 
Images need to be processed in order to extract relevant dig- 
ital features. A possible solution is to segment these images 
and extract domain specific information (such as the number 
of lesions); for DRD, the number of automatically detected mi- 
croaneurysms (the most frequent lesion of diabetic retinopathy) 
[37] is used. However, this kind of approach requires expert 
knowledge and a robust segmentation of images, which is not 
always possible because of acquisition variability. So, an ad- 
ditional solution to characterize images by their digital con- 
tent, without segmenting images, is proposed: a feature vector 



is extracted from the wavelet decomposition of the image [25]. 
An image signature is computed for each image field in a doc- 
ument (4 in DDSM: RCC, RMLO, LCC, LMLO and 10 in 
DRD); each image signature is associated with an attribute (see 
Section II-C). In conclusion, there are 24 attributes in DRD and 
six attributes in DDSM. 

E. Training and Test Sets 

Retrieval performance is assessed as follows. Both datasets 
are randomly divided into five subsets V \ , ■ , V 5 of equal 

size. Each subset Vi, i = 1 ... 5, is used in turn as test set while 
the remaining four subsets are used for training the retrieval 
system. Note that the test set is completely independent from 
the training process. 

F. Results 

The number of documents proposed by the system is typi- 
cally set to /C G {5, 10, 20}. Precisions obtained with each fu- 
sion method are reported in Table III. Because the cardinality 
of each class is small in DRD, performance was expected to de- 
crease as K, increases. For both databases, at JC = 5, the average 
precision is greater than 0 . 8 ; it means that, on average, more than 
80% of the selected documents are relevant for a query. We can 
see that, on DRD, the use of DSmT increases the average pre- 
cision at fC = 5 by about 10%, but not on DDSM. This can be 
explained by the fact that, on DRD, many sources of informa- 
tion are contextual: less reliable similarity measures are derived 
from these contextual sources (the sensitivity/specificity values 
of the corresponding tests 7} are lower), hence the interest of 
DSmT for this database. To assess the performance of the pro- 
posed fusion framework, independently of the underlying image 
signatures (described in Section II-C-1), it was compared to an 
early fusion [ 6 ] and a late fusion method [ 8 ] based on the same 
image signatures. The results we obtained for these methods are 
summarized in Table III. 

The average computation time to retrieve the five closest doc- 
uments for the second method is given in Table IV (computation 
times are similar with the first method). Clearly, most of the time 
is spent during the computation of image signatures. All experi- 
ments were conducted using an AMD Athlon 64-bit based com- 
puter running at 2 GHz. 
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TABLE III 

Precision Obtained With Different Methods 



Dataset 


DRD 


DDSM | 


Number of retrieved documents (/C) 


5 


10 


20 


5 


10 


20 


Bayesian network (see section II) 


0.704±0. 168 


0.654±0.I74 


0.55 1 ±0. 191 


0.821 ±0.177 


0.813±0. 179 


0.798±0.191 


Bayesian network + DSmT (see section III) 


0.809±0.158 


0.693 ±0.165 


0.590±0. 180 


0.803±0. 182 


0.801±0.185 


0.787±0. 188 


Bayesian network + DSmT (simplified signature computation) 


0.806±0. 158 


0.693±0.165 


0.587±0.180 


0.800±0. 184 


0.799±0. 186 


0.787±0.189 


Bayesian network + DSmT (images only) 


0.704±0.176 


0.640±0.181 


0.529±0.200 


0.759±0.192 


0.740±0.194 


0.725±0.194 


Early fusion [6] 


0.430±0.207 


0.448±0.203 


0.432±0.212 


0.7 1 4±0. 1 93 


0.731 ±0.1 92 


0.7 1 8±0. 1 96 


Late fusion [8] 


0.394±0.210 


0.431 ±0.1 94 


0.427±0.204 


0.703±0. 192 


0.7 1 7 ±0. 191 


0.700±0.200 



TABLE IV 

Computation Times for the DSmT Based Method 



database 


DRD 


DDSM 


retrieval (once signatures are computed) 


0.37 s 


4.67 s 


Do’s generalized Gaussian estimation method 


computing the signatures (for 1 image) 


4.57 s 


35.89 s 


average retrieval time (the average number of images 
per document is ~ 9 for DRD and 4 for DDSM) 


40.58 s 


148.27 s 


Simplified generalized Gaussian estimation method — see appendix A 


computing the signatures (for 1 image) 


0.25 s 


2.23 s 


average retrieval time 


2.58 s 


13.59 s 




Fig. 8. Robustness with respect to missing values. Note that documents are 
returned at random when no attributes are available (0 on the x-axis). 



To study the robustness of the method with respect to missing 
values the following test was carried out. 

• For each document x-i in the database, 100 new documents 

were generated as follows. Let rn be the number of at- 
tributes available for Xj, each new example was obtained 
by removing a number of attribute values randomly se- 
lected in {0, 1 , rii}. 

• The precision at five obtained for these generated docu- 
ments, with respect to the number of available attributes, 
was plotted in Fig. 8. 

Finally, for comparison purposes, the proposed system was 
applied to abnormal (“benign” or “cancer”) versus ‘normal’ 
document classification. 

• For each document Xi in the database (1364 abnormal 
and 695 normal), an abnormality index a(xi ) was defined; 
a(xi) is the percentage of abnormal documents among the 
topmost 1C results (if x i belongs to Vj, then the results are 
selected within the database minus Vj). 



• The receiver-operating curve (ROC) [38] of a(.) was 
plotted and the area under this curve, noted A z , was 
computed. 

An area under the ROC curve of A z = 0.921, A z = 0.917 and 
A z = 0.914 was obtained for 1C = 5, 1C = 10 and 1C = 20, re- 
spectively. In comparison, for the task of classifying regions of 
interest of 512 x 512 pixels (489 malignant masses, 412 benign 
masses, and 919 normal breasts), Mazurowski et al. obtained an 
area under the ROC curve of A z = 0.907 ± 0.024 using mutual 
information [38]. 

V. Discussion and Conclusion 

In this paper, we introduced two methods to include image se- 
ries and their signatures, with contextual information, in a CBR 
system. The first method uses a Bayesian network to model the 
relationships between attributes. It allows us to manage missing 
information, and to fuse several sources of information. In par- 
ticular, a method to include image signatures in a Bayesian net- 
work was proposed. In this first method, we modeled the rele- 
vance of a reference document in the database for the query, ac- 
cording to a given attribute Ai, through the design of conditional 
probabilities P q (Ai = aij\Q). The second method, based on the 
Dezert-Smarandache theory, extends the first one by improving 
the fusion operator: we modeled our confidence in each estima- 
tion of the relevance through the design of belief mass functions. 
These methods have been successfully applied to two medical 
image databases. These methods are generic: they can be ex- 
tended to databases containing sound, video, etc. The wavelet 
transform based signature, presented in Section II-C, can be ap- 
plied to any n-dimensional digital signal, using its n- dimen- 
sional wavelet transform (n = 1 for sound, n = 3 for video, 
etc) [39]. Extending the proposed image signature to n-dimen- 
sional wavelet transforms is trivial: characterizing the distribu- 
tion of wavelet coefficients simply implies iterating over rows, 
columns, depth (or time), etc., instead of rows and columns for 
a 2-D image (see Appendix A). The proposed methods are also 
convenient in the sense that they do not need to be retrained each 
time a new document is included in the database. 

The precision at five obtained for DRD (0.809 ± 0.158) is 
particularly interesting, considering the few examples available, 
the large number of missing values and the large number of 
classes taken into account. On this database, the methods out- 
perform usual methods by almost a factor of 2 in terms of preci- 
sion at 5. The improvement is also noticeable on DDSM (0.821 
± 0.177 compared to 0.714 ± 0.193). The proposed retrieval 
methods are fast: most of the computation time is spent during 
the image processing steps. The code may be parallelized to de- 
crease computation times further. Moreover, sufficient precision 
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can be reached before all the attributes are provided by the user. 
As a consequence, the user can stop formulating his query when 
the returned results are satisfactory. On DRD for instance, a pre- 
cision at five of 0.6 can be reached by providing less than 30% 
of the attributes (see Fig. 8 ): with this precision, the majority of 
the retrieved documents (3 out of 5) belong to the right class. 
Table III shows that the difference, in terms of retrieval perfor- 
mance, between single image retrieval [25] and heterogeneous 
document retrieval, comes from the combination of image fea- 
tures extracted from several images, more than the inclusion of 
contextual attributes. 

This study has three limitations. First, only one type of image 
feature [25] has been included in the retrieval system (two for 
DRD [25], [37]). In particular, the inclusion of application-spe- 
cific image features will have to be validated on several medical 
image databases. Second, the reference standards are affected by 
inter- and intra- observer variability, further validation and ob- 
server studies are needed. Finally, as it has been shown by Cheng 
et al. the size of the dataset has an influence on the correctness 
of the generated Bayesian networks. DRD, in particular, is small 
compared to the datasets used to validate Bayesian network gen- 
eration methods [24], The limited size of the dataset may also 
impact the performance on the test set, especially if 1C is larger 
than (or is in the order of) the number of cases belonging to some 
of the classes within the dataset. 

As a conclusion, using appropriate information fusion opera- 
tors, heterogeneous case retrieval in medical digital databases is 
a powerful tool to build reliable CADx systems. In future works, 
we will try to improve retrieval performance further through the 
use of relevance feedback [4] and through the inclusion of local- 
ized image features. A web interface, that will permit relevance 
feedback, is being developed to allow assessment of clinical use- 
fulness by physicians. 



numbers were chosen to reduce the approximation error on 
an independent dataset 1 ). 

3) Let hk be the number of coefficients assigned to the /,-th 
bin, and v/. the centroid of that bin. 

/, 1 \ 2 na 

v k = -na + Ik- -J — . (18) 

Equation (17) becomes 

< 19) 

All other equations in [27] are modified similarly. 
Appendix B 

P q (Ai = aij\Q): Computation Details 
For each attribute A,;, i — 1 . . . M, we want P q (Q\Ai = aij) 
to be proportional to r zj = Y,kL\ 0 ‘ik(Xq)S i j k (see 
Section II-E). In that purpose, we first determine p; = 
Pq(,Q\ A i A’ arg max, (r ; 7 )). Let fij = Tijl max fc (r ifc ). The 
following constraints have to be satisfied: 

P q (Q\Ai = a,ij ) + P q (Q\Ai = a,ij ) = 1 (20) 

E P q(Q\ A i = a,ij)P(Ai = a,ij) = P q (Q ) (21) 

3 

E Pq(Q\ A i = aij)P{Ai = a i:j ) = P q {Q) ( 22 ) 

3 

where P q (Q), P q (Q ), and P(Ai = dij ) are prior probabilities. 
Injecting pi and fp: in (21), we obtain 

E Pi-hj-P(Ai = aij) = P q (Q), i — 1 . . . M (23) 

3 



Appendix A 

Fast Parameter Estimation for Generalized 
Gaussian Distributions 

In Do’s parameter estimation method [27], the parameters of 
the wavelet coefficient distribution in a M x N subband X = 
, % = 1 . . . M,j = 1 ... TV}, namely a and /3, are obtained 
by iterating over all coefficients in this subband. For instance, a 
is obtained as follows: 



( a M N \ ? 

“= L^vEEi^T ,17) 

\ 1=1 j =1 / 

where [3 is an approximation of 0, which is iteratively refined 
using the Newton-Raphson procedure [27]. The computation of 
/ 3 relies, for each wavelet coefficient, on multiple evaluations of 
the logarithm and the digamma function, which implies slow 
computations. 

We propose to significantly reduce the number of such eval- 
uations by applying Do’s estimation method, not directly to X, 
but to a histogram of X. 

1) The standard deviation a of X is computed. 

2) A T?-bins histogram of X , restricted to the [—na; na] in- 
terval, is computed (we used B = 64 and n = 5 — these 



Pi is then extracted from (23) 

Pg(Q) 



Pi = 



J2i^ii-P(Ai = ap 






(24) 



Once pi is computed, P q (Q\Ai = a ia , TgmaXj ( rij )) = 1 - p, 
can be computed [see (20)]. Other conditional probabilities are 
deduced from the definition of f t j : P q (Q\Ai = a,;. ; ) = Pi-fij. 

If the most desirable state for attribute Ai(argmaxj(Vij)) is 
a rare state, it is possible that p, > 1 . Indeed, in constraint 
(21), P q (Q\Ai = fl 7a , rgI na. X i : (r,; fc )) is multiplied by a small value 
(P(Ai = a; argmaxifcO-i*))). the res ult of this product is small 
and the other terms of the sum (with a value P q (Q\Ai = dij ) 
smaller than P q (Q\Ai = « v at . g ,„ aX; (, : ., ( )) by definition) might 
be too small for the sum to reach P q (Q). In that case, the con- 
ditional probabilities should be changed as follows: 

• we set pi = 1 , 

• each fij, j 7 ^ argmaxfe(rife), is multiplied by a constant 
7 > 0. 

With this setup, constraint (21) becomes 



P(Ai = a^) + E 7 .fij.P(Ai 

jAarg maxj. (r t fe) 



= a ij) = Pq(Q)- 
(25) 



1 http://vismod.media.mit.edu/vismod/imagery/VisionTexture/vistex.html. 
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Finally, 7 is extracted from (26) and conditional probabilities 
from (27) 

_ Pq ( Q) ~ P {Aj — CLj arg maxj (r, j ) ) (26) 

Zj^argmax fe (r ifc ) hj-P{Ai=Oij ) 

Pg(Q|^ = aij)= 7 .fij> jYaxgmaXj-^ij). (27) 

The inequality P q (Q) > P(Ai = a iabTgmaXk ( r , k )) always 
holds, as a consequence 7 > 0. Indeed P q (Q) > P q (Q\Ai = 
argmaxj, ()’,fc))-P(^i — argmax/, (?’ifc)) [according to con- 
straint (21)], i.e., Pg(Q) > Pi.P(Ai = aiargmaxfclnfc)); S iven 
that pi = 1, the following inequality holds: P q {Q) > /'’(A, = 

A arg maxj; (r,fc ) ) • 
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